First, we need to install and load the necessary packages, including
dplyr for data manipulation and tidyverse and
plotly for visualization.
# install.packages("dplyr")
# install.packages("tidyverse")
# install.packages("plotly")
library(dplyr)
library(tidyverse)
library(plotly)
Reading and Viewing the Titanic Dataset We start by reading the Titanic dataset from a CSV file into a dataframe.
titanic_data <- read.csv("Titanic-Dataset.csv")
You can view the dataset in a spreadsheet-like format:
View(titanic_data)
Summary Statistics and Data Structure
To understand the dataset better, we run summary statistics and examine its structure.
summary(titanic_data)
## PassengerId Survived Pclass Name
## Min. : 1.0 Min. :0.0000 Min. :1.000 Length:891
## 1st Qu.:223.5 1st Qu.:0.0000 1st Qu.:2.000 Class :character
## Median :446.0 Median :0.0000 Median :3.000 Mode :character
## Mean :446.0 Mean :0.3838 Mean :2.309
## 3rd Qu.:668.5 3rd Qu.:1.0000 3rd Qu.:3.000
## Max. :891.0 Max. :1.0000 Max. :3.000
##
## Sex Age SibSp Parch
## Length:891 Min. : 0.42 Min. :0.000 Min. :0.0000
## Class :character 1st Qu.:20.12 1st Qu.:0.000 1st Qu.:0.0000
## Mode :character Median :28.00 Median :0.000 Median :0.0000
## Mean :29.70 Mean :0.523 Mean :0.3816
## 3rd Qu.:38.00 3rd Qu.:1.000 3rd Qu.:0.0000
## Max. :80.00 Max. :8.000 Max. :6.0000
## NA's :177
## Ticket Fare Cabin Embarked
## Length:891 Min. : 0.00 Length:891 Length:891
## Class :character 1st Qu.: 7.91 Class :character Class :character
## Mode :character Median : 14.45 Mode :character Mode :character
## Mean : 32.20
## 3rd Qu.: 31.00
## Max. :512.33
##
str(titanic_data)
## 'data.frame': 891 obs. of 12 variables:
## $ PassengerId: int 1 2 3 4 5 6 7 8 9 10 ...
## $ Survived : int 0 1 1 1 0 0 0 0 1 1 ...
## $ Pclass : int 3 1 3 1 3 3 1 3 3 2 ...
## $ Name : chr "Braund, Mr. Owen Harris" "Cumings, Mrs. John Bradley (Florence Briggs Thayer)" "Heikkinen, Miss. Laina" "Futrelle, Mrs. Jacques Heath (Lily May Peel)" ...
## $ Sex : chr "male" "female" "female" "female" ...
## $ Age : num 22 38 26 35 35 NA 54 2 27 14 ...
## $ SibSp : int 1 1 0 1 0 0 0 3 0 1 ...
## $ Parch : int 0 0 0 0 0 0 0 1 2 0 ...
## $ Ticket : chr "A/5 21171" "PC 17599" "STON/O2. 3101282" "113803" ...
## $ Fare : num 7.25 71.28 7.92 53.1 8.05 ...
## $ Cabin : chr "" "C85" "" "C123" ...
## $ Embarked : chr "S" "C" "S" "S" ...
Data Cleaning 1. Fill Missing Age Values We replace missing values in the Age column with the median age.
titanic_data <- titanic_data %>%
mutate(Age = ifelse(is.na(Age), median(Age, na.rm = TRUE), Age))
titanic_data <- titanic_data %>%
filter(Embarked != "")
titanic_data <- titanic_data %>%
select(-Cabin)
titanic_data <- titanic_data %>%
mutate(Survived = as.factor(Survived),
Pclass = as.factor(Pclass),
Sex = as.factor(Sex),
Embarked = as.factor(Embarked))
names(titanic_data) <- tolower(names(titanic_data))
write.csv(titanic_data, "cleaned_titanic_data.csv", row.names = FALSE)
Overview The Titanic dataset contains detailed information about the passengers aboard the Titanic, including their age, class, fare, survival status, and more. Through these visualizations, we aim to uncover insights into the factors that influenced survival and the demographic composition of the passengers.
Survival Count Bar Plot This plot displays the count of passengers who survived versus those who did not.
ggplot(titanic_data, aes(x = survived)) +
geom_bar() +
xlab("Survived") +
ylab("Count") +
ggtitle("Count of Survived Passengers on the Titanic")
Age Distribution Histogram This histogram shows the distribution of
passengers’ ages with bins of 5 years.
ggplot(titanic_data, aes(x = age)) +
geom_histogram(binwidth = 5, fill = "blue", color = "black") +
xlab("Age") +
ylab("Count") +
ggtitle("Age Distribution of Passengers")
Boxplot of Age by Survival Status This boxplot compares the age
distribution of passengers who survived and those who did not.
ggplot(titanic_data, aes(x = survived, y = age)) +
geom_boxplot() +
xlab("Survived") +
ylab("Age") +
ggtitle("Age Distribution by Survival Status")
Violin Plot of Age by Survival Status This violin plot provides a more
detailed look at the age distribution by survival status, including the
density of the distribution.
ggplot(titanic_data, aes(x = survived, y = age)) +
geom_violin() +
xlab("Survived") +
ylab("Age") +
ggtitle("Age Distribution by Survival Status")
Passenger Class Distribution Bar Plot This bar plot illustrates the
distribution of passengers across different classes.
ggplot(titanic_data, aes(x = pclass)) +
geom_bar(fill = "red") +
xlab("Passenger Class") +
ylab("Count") +
ggtitle("Count of Passengers by Class")
Embarkation Point Distribution Bar Plot This bar plot shows how many
passengers embarked from each point (C = Cherbourg, Q = Queenstown, S =
Southampton).
ggplot(titanic_data, aes(x = embarked)) +
geom_bar(fill = "purple") +
xlab("Embarkation Point") +
ylab("Count") +
ggtitle("Count of Passengers by Embarkation Point")
Scatter Plot of Age vs. Fare This scatter plot examines the relationship
between passengers’ ages and the fares they paid.
ggplot(titanic_data, aes(x = age, y = fare)) +
geom_point(color = "red") +
xlab("Age") +
ylab("Fare") +
ggtitle("Scatter Plot of Age vs. Fare")
Facet Grid Scatter Plot of Age vs. Fare by Survival Status This plot
adds a facet grid to the scatter plot to separate passengers by survival
status.
ggplot(titanic_data, aes(x = age, y = fare)) +
geom_point() +
facet_grid(. ~ survived) +
xlab("Age") +
ylab("Fare") +
ggtitle("Scatter Plot of Age vs. Fare by Survival Status")
Facet Grid of Age vs. Fare by Passenger Class This plot shows the
relationship between age and fare, with separate facets for each
passenger class.
ggplot(titanic_data, aes(x = age, y = fare)) +
geom_point(color = "green") +
facet_grid(. ~ pclass) +
xlab("Age") +
ylab("Fare") +
ggtitle("Scatter Plot of Age vs. Fare by Passenger Class")
Combined Scatter Plot of Age vs. Fare by Passenger Class This plot
displays a scatter plot of age versus fare, color-coded by passenger
class.
ggplot(titanic_data, aes(x = age, y = fare, color = pclass)) +
geom_point(size = 2) +
scale_color_manual(values = c("1" = "red", "2" = "orange", "3" = "green")) +
xlab("Age") +
ylab("Fare") +
ggtitle("Age vs. Fare by Passenger Class") +
labs(color = "Passenger Class")
Interactive Scatter Plot of Age vs. Fare by Passenger Class and Sex This
interactive plot uses plotly to allow exploration of the data by
passenger class and sex.
ggplotly(
ggplot(titanic_data, aes(x = age, y = fare, color = sex)) +
geom_point() +
facet_wrap(~ pclass) +
xlab("Age") +
ylab("Fare") +
ggtitle("Age vs. Fare by Passenger Class and Sex")
)
Stacked Bar Plot of Survival by Passenger Class This plot visualizes the survival proportions across different passenger classes using a stacked bar plot.
ggplot(titanic_data, aes(x = pclass, fill = survived)) +
geom_bar(position = "fill") +
xlab("Passenger Class") +
ylab("Proportion") +
labs(fill = "Survived") +
ggtitle("Survival Proportions by Passenger Class")
Interactive Stacked Bar Plot of Survival by Passenger Class This
interactive version of the stacked bar plot uses plotly, showing
survival proportions by passenger class with hover information.
plot <- plot_ly(titanic_data,
x = ~pclass,
y = ~percentage,
type = 'bar',
color = ~survived,
text = ~paste('Status:', survived, '<br>Percentage:', round(percentage, 2), '%'),
hoverinfo = 'text',
textposition = 'auto') %>%
layout(barmode = 'stack',
xaxis = list(title = 'Passenger Class'),
yaxis = list(title = 'Percentage'),
title = 'Survival Proportions by Passenger Class',
legend = list(title = list(text = 'Survival Status')))
Conclusion This document walked through the process of cleaning the Titanic dataset and provided various visualizations to explore the data further. These plots reveal insights into survival rates, age distributions, and the relationships between different variables such as age, fare, and passenger class.